We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.
# imports
from time import time
import seaborn as sns
import pandas_profiling
from pprint import pprint
import matplotlib.pyplot as plt
import pandas as pd, numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
%matplotlib inline
SIGNAL_DF = pd.read_csv('signal-data.csv') # Analytics Base Table (ABT): Raw dataset
signal_df = SIGNAL_DF.copy()
signal_df.shape # 1567 x 592 |592 Features
# Clear case of Curse of Dimensionality → Dimensionality Reduction / Feature Selection required.
(1567, 592)
signal_df.sample(5)
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1238 | 2008-02-10 09:10:00 | 3060.00 | 2571.41 | 2199.6556 | 1140.3983 | 1.3369 | 100.0 | 103.0967 | 0.1227 | 1.4300 | ... | 37.6251 | 0.5013 | 0.0095 | 0.0022 | 1.8910 | 0.0193 | 0.0072 | 0.0026 | 37.6251 | 1 |
| 196 | 2008-12-08 06:16:00 | 2967.40 | 2553.04 | 2304.2111 | 1857.8658 | 1.7719 | 100.0 | 96.9967 | 0.1183 | 1.5334 | ... | 104.0012 | 0.4997 | 0.0110 | 0.0030 | 2.1936 | 0.0123 | 0.0128 | 0.0038 | 104.0012 | -1 |
| 250 | 2008-08-18 06:11:00 | 3079.90 | 2463.51 | 2178.7333 | 1039.3641 | 0.7367 | 100.0 | 101.4922 | 0.1219 | 1.5145 | ... | NaN | 0.4978 | 0.0137 | 0.0030 | 2.7444 | 0.0123 | 0.0270 | 0.0079 | 220.0378 | -1 |
| 1256 | 2008-02-10 22:23:00 | 2914.04 | 2487.10 | 2238.1444 | 1580.6951 | 1.0062 | 100.0 | 91.0489 | 0.1230 | 1.4778 | ... | 68.2176 | 0.4928 | 0.0141 | 0.0039 | 2.8615 | 0.0223 | 0.0152 | 0.0043 | 68.2176 | -1 |
| 603 | 2008-08-31 09:11:00 | 3114.92 | 2574.48 | 2200.0666 | 1012.6747 | 1.3954 | 100.0 | 103.0644 | 0.1212 | 1.4207 | ... | 79.7752 | 0.5043 | 0.0128 | 0.0038 | 2.5373 | 0.0199 | 0.0159 | 0.0048 | 79.7752 | -1 |
5 rows × 592 columns
signal_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1567 entries, 0 to 1566 Columns: 592 entries, Time to Pass/Fail dtypes: float64(590), int64(1), object(1) memory usage: 7.1+ MB
signal_df.dtypes
Time object
0 float64
1 float64
2 float64
3 float64
...
586 float64
587 float64
588 float64
589 float64
Pass/Fail int64
Length: 592, dtype: object
signal_df.dtypes.value_counts() # Time is object, Pass/Fail is int64 & Rest are floating point numbers
float64 590 object 1 int64 1 dtype: int64
signal_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1561.0 | 3014.452896 | 73.621787 | 2743.2400 | 2966.260000 | 3011.4900 | 3056.6500 | 3356.3500 |
| 1 | 1560.0 | 2495.850231 | 80.407705 | 2158.7500 | 2452.247500 | 2499.4050 | 2538.8225 | 2846.4400 |
| 2 | 1553.0 | 2200.547318 | 29.513152 | 2060.6600 | 2181.044400 | 2201.0667 | 2218.0555 | 2315.2667 |
| 3 | 1553.0 | 1396.376627 | 441.691640 | 0.0000 | 1081.875800 | 1285.2144 | 1591.2235 | 3715.0417 |
| 4 | 1553.0 | 4.197013 | 56.355540 | 0.6815 | 1.017700 | 1.3168 | 1.5257 | 1114.5366 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 586 | 1566.0 | 0.021458 | 0.012358 | -0.0169 | 0.013425 | 0.0205 | 0.0276 | 0.1028 |
| 587 | 1566.0 | 0.016475 | 0.008808 | 0.0032 | 0.010600 | 0.0148 | 0.0203 | 0.0799 |
| 588 | 1566.0 | 0.005283 | 0.002867 | 0.0010 | 0.003300 | 0.0046 | 0.0064 | 0.0286 |
| 589 | 1566.0 | 99.670066 | 93.891919 | 0.0000 | 44.368600 | 71.9005 | 114.7497 | 737.3048 |
| Pass/Fail | 1567.0 | -0.867262 | 0.498010 | -1.0000 | -1.000000 | -1.0000 | -1.0000 | 1.0000 |
591 rows × 8 columns
target = signal_df['Pass/Fail']
signal_df.drop('Pass/Fail', axis = 1, inplace = True)
signal_df.Time
0 2008-07-19 11:55:00
1 2008-07-19 12:32:00
2 2008-07-19 13:17:00
3 2008-07-19 14:43:00
4 2008-07-19 15:22:00
...
1562 2008-10-16 15:13:00
1563 2008-10-16 20:49:00
1564 2008-10-17 05:26:00
1565 2008-10-17 06:01:00
1566 2008-10-17 06:07:00
Name: Time, Length: 1567, dtype: object
signal_df.Time.nunique()
1534
# Redundant variable → Time
# Drop the variable “Time” which is just a timestamp and does not add any predictive value to predict the target column "Pass/Fail"
signal_df.drop(['Time'], axis = 1, inplace = True)
dropped_cols = ['Time']
print(SIGNAL_DF.shape, '→', signal_df.shape)
(1567, 592) → (1567, 590)
Checking for columns like columns with near zero variance.
For this we can calculate number of unique values in each column and if there is only one or very few unique values. We can delete these columns as they hold no information or predictive power for the models to use
# !pip install exploretransform
import exploretransform as et
et.peek(signal_df) # returns dtype, levels, # of observations, and first five observations for a dataframe
| variable | dtype | lvls | obs | head | |
|---|---|---|---|---|---|
| 0 | 0 | float64 | 1520 | 1567 | [3030.93, 3095.78, 2932.61, 2988.72, 3032.24] |
| 1 | 1 | float64 | 1504 | 1567 | [2564.0, 2465.14, 2559.94, 2479.9, 2502.87] |
| 2 | 2 | float64 | 507 | 1567 | [2187.7333, 2230.4222, 2186.4111, 2199.0333, 2... |
| 3 | 3 | float64 | 518 | 1567 | [1411.1265, 1463.6606, 1698.0172, 909.7926, 13... |
| 4 | 4 | float64 | 503 | 1567 | [1.3602, 0.8294, 1.5102, 1.3204, 1.5334] |
| ... | ... | ... | ... | ... | ... |
| 585 | 585 | float64 | 1502 | 1567 | [2.363, 4.4447, 3.1745, 2.0544, 99.3032] |
| 586 | 586 | float64 | 322 | 1567 | [nan, 0.0096, 0.0584, 0.0202, 0.0202] |
| 587 | 587 | float64 | 260 | 1567 | [nan, 0.0201, 0.0484, 0.0149, 0.0149] |
| 588 | 588 | float64 | 120 | 1567 | [nan, 0.006, 0.0148, 0.0044, 0.0044] |
| 589 | 589 | float64 | 611 | 1567 | [nan, 208.2045, 82.8602, 73.8432, 73.8432] |
590 rows × 5 columns
skew_stats = et.skewstats(signal_df)
skew_stats
| dtype | skewness | magnitude | |
|---|---|---|---|
| 119 | float64 | -5.377755 | 2-high |
| 386 | float64 | 17.017257 | 2-high |
| 116 | float64 | -3.282006 | 2-high |
| 388 | float64 | 6.279922 | 2-high |
| 390 | float64 | 39.524898 | 2-high |
| ... | ... | ... | ... |
| 192 | float64 | NaN | 0-approx_symmetric |
| 191 | float64 | NaN | 0-approx_symmetric |
| 190 | float64 | NaN | 0-approx_symmetric |
| 189 | float64 | NaN | 0-approx_symmetric |
| 589 | float64 | NaN | 0-approx_symmetric |
590 rows × 3 columns
skew_stats.magnitude.value_counts() # 44 features with high skew
0-approx_symmetric 546 2-high 40 1-medium 4 Name: magnitude, dtype: int64
import missingno as msno # To visualize missing values
msno.matrix(signal_df, figsize=(20, 15)) # White lines indicate missing values
<AxesSubplot:>
signal_df.isna().sum() # Missing values are present in various columns → Have to be treated during missing value treatment
0 6
1 7
2 14
3 14
4 14
..
585 1
586 1
587 1
588 1
589 1
Length: 590, dtype: int64
percent_miss = signal_df.isna().mean()*100 # Percentage of values missing within each of the variables
percent_miss
0 0.382897
1 0.446713
2 0.893427
3 0.893427
4 0.893427
...
585 0.063816
586 0.063816
587 0.063816
588 0.063816
589 0.063816
Length: 590, dtype: float64
percent_miss.value_counts()
0.382897 100 0.063816 92 0.127632 84 0.000000 52 0.574346 48 1.531589 43 0.191449 24 0.255265 24 0.446713 20 0.893427 20 16.592214 12 0.510530 12 64.964901 12 3.254627 8 17.421825 8 0.638162 4 45.628590 4 91.193363 4 60.561583 4 50.670070 4 85.577537 4 0.765795 4 0.319081 3 dtype: int64
temp = signal_df.T[percent_miss > 5].T
temp
| 72 | 73 | 85 | 109 | 110 | 111 | 112 | 157 | 158 | 220 | ... | 564 | 565 | 566 | 567 | 568 | 569 | 578 | 579 | 580 | 581 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0096 | 0.0201 | 0.0060 | 208.2045 |
| 2 | 140.6972 | 485.2665 | NaN | NaN | NaN | NaN | 0.4684 | NaN | NaN | NaN | ... | 1.10 | 0.6219 | 0.4122 | 0.2562 | 0.4119 | 68.8489 | 0.0584 | 0.0484 | 0.0148 | 82.8602 |
| 3 | 160.3210 | 464.9735 | NaN | NaN | NaN | NaN | 0.4647 | NaN | NaN | NaN | ... | 7.32 | 0.1630 | 3.5611 | 0.0670 | 2.7290 | 25.0363 | 0.0202 | 0.0149 | 0.0044 | 73.8432 |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1562 | NaN | NaN | NaN | 0.9833 | 102.0542 | 231.1404 | NaN | NaN | NaN | NaN | ... | 4.98 | 0.0877 | 2.0902 | 0.0382 | 1.8844 | 15.4662 | 0.0068 | 0.0138 | 0.0047 | 203.1720 |
| 1563 | 158.7832 | 463.3486 | NaN | 0.9824 | 97.5426 | 235.2582 | NaN | NaN | NaN | NaN | ... | 4.56 | 0.1308 | 1.7420 | 0.0495 | 1.7089 | 20.9118 | NaN | NaN | NaN | NaN |
| 1564 | NaN | NaN | 0.1119 | 0.9839 | 101.4167 | 231.2249 | NaN | NaN | NaN | 0.0089 | ... | 11.09 | 0.2388 | 4.4128 | 0.0965 | 4.3197 | 29.0954 | 0.0197 | 0.0086 | 0.0025 | 43.5231 |
| 1565 | NaN | NaN | NaN | 0.9828 | 101.3445 | 233.0335 | NaN | NaN | NaN | NaN | ... | 4.98 | 0.0877 | 2.0902 | 0.0382 | 1.8844 | 15.4662 | 0.0262 | 0.0245 | 0.0075 | 93.4941 |
| 1566 | NaN | NaN | NaN | 0.9814 | 102.3153 | 231.3263 | NaN | NaN | NaN | NaN | ... | 8.42 | 0.1307 | 3.0894 | 0.0493 | 3.2639 | 21.1128 | 0.0117 | 0.0162 | 0.0045 | 137.7844 |
1567 rows × 52 columns
msno.matrix(temp, figsize=(20, 15)) # (White lines indicate missing values)
# A few columns have high number of missing values! → Remove such variables as they can't be imputed properly
<AxesSubplot:>
pd.DataFrame(percent_miss[percent_miss > 5].sort_values(ascending = False))
| 0 | |
|---|---|
| 293 | 91.193363 |
| 157 | 91.193363 |
| 158 | 91.193363 |
| 292 | 91.193363 |
| 85 | 85.577537 |
| 492 | 85.577537 |
| 220 | 85.577537 |
| 358 | 85.577537 |
| 517 | 64.964901 |
| 516 | 64.964901 |
| 384 | 64.964901 |
| 383 | 64.964901 |
| 382 | 64.964901 |
| 245 | 64.964901 |
| 246 | 64.964901 |
| 518 | 64.964901 |
| 244 | 64.964901 |
| 111 | 64.964901 |
| 110 | 64.964901 |
| 109 | 64.964901 |
| 580 | 60.561583 |
| 581 | 60.561583 |
| 579 | 60.561583 |
| 578 | 60.561583 |
| 72 | 50.670070 |
| 346 | 50.670070 |
| 73 | 50.670070 |
| 345 | 50.670070 |
| 247 | 45.628590 |
| 112 | 45.628590 |
| 385 | 45.628590 |
| 519 | 45.628590 |
| 568 | 17.421825 |
| 563 | 17.421825 |
| 567 | 17.421825 |
| 566 | 17.421825 |
| 569 | 17.421825 |
| 565 | 17.421825 |
| 564 | 17.421825 |
| 562 | 17.421825 |
| 547 | 16.592214 |
| 546 | 16.592214 |
| 556 | 16.592214 |
| 555 | 16.592214 |
| 554 | 16.592214 |
| 553 | 16.592214 |
| 552 | 16.592214 |
| 551 | 16.592214 |
| 550 | 16.592214 |
| 549 | 16.592214 |
| 548 | 16.592214 |
| 557 | 16.592214 |
plt.figure(figsize=(15, 20))
percent_miss[percent_miss > 5].sort_values(ascending = True).plot(kind = 'barh', color = 'gray')
plt.show()
In order to deal with missing values, we check the missing values by each observation and variable. The figure above shows that there are many variables have no more than 17.42% of missing values. However, High percentage of missing values do occur in some of the variables ranging from 45.62% to 91.13%. Thus, we drop the variables which contain more than 40% missing values and impute the remaining variables.
There are several methods for missing value imputation such as k nearest neighbor, regression, random forest. K nearest neighbor (KNN) represents a natural improvement of Mean that exploits the observed data structure. Multivariate Imputation by Chained Equations (MICE) are based on a much more complex algorithm and its behavior appears to be related to the size of the dataset, however it becomes time-intensive when applied to the large datasets. So, we conduct KNN to impute the missing values which is relatively a efficient algorithm while dealing with high dimensional data.
np.mean(percent_miss > 40)*100 # dropping 5% of the variables which have more than 40% of the values missing.
5.423728813559322
signal_df_dropped = signal_df.T[percent_miss < 40].T
print(signal_df.shape, '→', signal_df_dropped.shape)
(1567, 590) → (1567, 558)
filtered_cols = list(signal_df_dropped.columns)
len(filtered_cols)
558
# impute using KNN Imputer
# split into train and validation set before imputation during training to prevent data leakage
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 7) # n_neighbors is a hyperparameter, could be tuned during grid search
X_imputed = imputer.fit_transform(signal_df_dropped)
X_imputed.shape
(1567, 558)
signal_df_imputed = pd.DataFrame(X_imputed, columns = signal_df_dropped.columns)
signal_df_imputed.isna().sum().any() # No more missing values!
False
# Feature selector that removes all low-variance features
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
var_selector = VarianceThreshold(threshold=(threshold_n*(1-threshold_n)))
# Features with a training-set variance lower than this threshold will be removed.
# The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.
X_selected = var_selector.fit_transform(signal_df_imputed)
print(signal_df_imputed.shape, '→', X_selected.shape)
(1567, 558) → (1567, 269)
signal_df_var_selcted = signal_df_imputed[signal_df_imputed.columns[var_selector.get_support(indices=True)]]
print(signal_df_imputed.shape, '→', signal_df_var_selcted.shape)
(1567, 558) → (1567, 269)
signal_df_var_selcted.sample(5) # selected features with near zero variance features removed
| 0 | 1 | 2 | 3 | 4 | 6 | 12 | 14 | 15 | 16 | ... | 569 | 570 | 571 | 572 | 573 | 574 | 576 | 577 | 585 | 589 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 145 | 3095.50 | 2394.58 | 2196.6555 | 1066.1908 | 1.2188 | 101.8900 | 197.5659 | 12.0865 | 407.8535 | 10.0980 | ... | 18.3453 | 536.2418 | 2.4342 | 8.74 | 0.4067 | 3.1845 | 1.6299 | 16.7075 | 2.6501 | 397.5003 |
| 1359 | 3005.90 | 2534.68 | 2231.0555 | 1303.5386 | 0.9751 | 95.7878 | 198.9829 | 10.3404 | 424.4315 | 9.8089 | ... | 34.7443 | 532.1382 | 2.0329 | 9.02 | 0.2568 | 3.0501 | 1.6951 | 12.6333 | 2.0438 | 193.4633 |
| 743 | 2996.90 | 2448.07 | 2162.7556 | 1041.1557 | 0.8479 | 107.2622 | 198.3266 | 8.4330 | 406.5568 | 10.4182 | ... | 12.2486 | 532.1391 | 2.3488 | 9.44 | 0.4671 | 3.0702 | 1.7740 | 19.8852 | 3.3497 | 59.9813 |
| 396 | 3127.07 | 2478.97 | 2198.7222 | 1534.2053 | 0.9374 | 104.1989 | 200.5570 | 6.5375 | 401.5945 | 9.9879 | ... | 18.0932 | 534.4473 | 2.4356 | 9.33 | 0.1854 | 3.3165 | 1.7457 | 7.6120 | 1.7695 | 86.7035 |
| 1325 | 3193.53 | 2587.86 | 2162.1333 | 998.9095 | 0.8826 | 104.9722 | 202.4111 | 5.6064 | 412.9307 | 9.9604 | ... | 15.4662 | 531.3445 | 2.0359 | 5.71 | 0.2796 | 1.5650 | 1.0746 | 13.7332 | 2.1296 | 91.4264 |
5 rows × 269 columns
# check for highly correlated variables
N = signal_df_var_selcted.select_dtypes('number').copy()
tic = time()
c = et.corrtable(N, cut = 0.5, full= True, methodx = 'pearson')
toc = time()
c
| v1 | v2 | v1.target | v2.target | corr | drop | |
|---|---|---|---|---|---|---|
| 22559 | 209 | 347 | 0.075945 | 0.075945 | 1.000000 | 209 |
| 22619 | 209 | 478 | 0.075945 | 0.075945 | 1.000000 | 209 |
| 28845 | 347 | 478 | 0.075945 | 0.075945 | 1.000000 | 478 |
| 5419 | 34 | 36 | 0.040723 | 0.040723 | 1.000000 | 36 |
| 15794 | 140 | 275 | 0.038780 | 0.038780 | 1.000000 | 140 |
| ... | ... | ... | ... | ... | ... | ... |
| 9383 | 60 | 425 | 0.066404 | 0.056380 | 0.000009 | |
| 21627 | 202 | 486 | 0.117651 | 0.025268 | 0.000008 | |
| 25043 | 285 | 340 | 0.037899 | 0.114714 | 0.000007 | |
| 12463 | 96 | 287 | 0.028355 | 0.054914 | 0.000005 | |
| 34793 | 489 | 545 | 0.024411 | 0.046656 | 0.000002 |
36046 rows × 6 columns
print(f'corr_filter fit time: {(toc-tic):.2f}s')
corr_filter fit time: 150.75s
# Based on the output of corrtable(), calcdrop() determines which features should be dropped.
correlated_cols = et.calcdrop(c)
pprint(correlated_cols, compact = True) # to drop
['388', '185', '269', '207', '316', '197', '431', '343', '420', '151', '363', '577', '205', '188', '470', '28', '36', '34', '225', '333', '21', '139', '155', '289', '430', '148', '576', '90', '479', '459', '181', '51', '154', '67', '480', '223', '142', '338', '347', '16', '283', '421', '63', '435', '406', '180', '277', '167', '50', '271', '177', '539', '339', '340', '555', '428', '32', '65', '335', '209', '83', '413', '474', '427', '425', '198', '62', '166', '196', '423', '457', '456', '549', '553', '165', '301', '270', '252', '455', '550', '70', '182', '572', '573', '27', '295', '96', '274', '344', '55', '285', '60', '12', '150', '361', '475', '321', '337', '287', '164', '138', '434', '183', '476', '522', '136', '199', '125', '566', '300', '410', '318', '6', '524', '551', '319', '554', '61', '323', '152', '187', '202', '556', '140', '46', '477', '341', '336', '294', '39', '201', '203', '195', '286', '552', '440', '490', '66', '272', '137', '4', '439', '159', '471', '541', '436', '411', '469', '415', '297', '218', '332', '128', '135', '564', '540', '117', '478', '204', '45', '296', '497']
print(len(correlated_cols))
162
print(len(signal_df_var_selcted.columns) - len(correlated_cols))
107
signal_df_selected = signal_df_var_selcted[list(set(signal_df_var_selcted.columns) - set(correlated_cols))]
print(signal_df_var_selcted.shape, '→', signal_df_selected.shape)
(1567, 269) → (1567, 107)
signal_df_selected.sample(5)
| 500 | 37 | 511 | 22 | 407 | 482 | 126 | 416 | 468 | 546 | ... | 526 | 485 | 15 | 59 | 448 | 489 | 438 | 43 | 390 | 460 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1548 | 0.0 | 65.8936 | 659.6491 | 2350.50 | 1.0918 | 395.773143 | 2.454 | 5.4282 | 579.4460 | 1.451800 | ... | 1.5884 | 197.848629 | 413.0515 | 2.9545 | 0.6701 | 401.898843 | 35.7616 | 351.2718 | 0.8128 | 26.1713 |
| 1412 | 0.0 | 66.0377 | 558.4906 | 2689.25 | 1.1919 | 152.568600 | 2.673 | 2.6741 | 425.0778 | 1.185100 | ... | 2.2616 | 645.945900 | 423.8394 | -1.4609 | 0.2510 | 379.874200 | 124.7525 | 363.9527 | 0.4559 | 26.5800 |
| 84 | 0.0 | 66.5600 | 833.3333 | 2259.25 | 1.1227 | 190.294100 | 2.914 | 4.7292 | 68.3330 | 1.556900 | ... | 0.4537 | 117.872700 | 397.9443 | 8.6927 | 0.9135 | 131.040700 | 22.7848 | 354.4182 | 1.6373 | 38.0561 |
| 97 | 0.0 | 65.9507 | 0.0000 | 2598.75 | 1.6137 | 221.753500 | 2.643 | 2.9451 | 32.8258 | 1.367800 | ... | 2.1506 | 71.838100 | 422.5091 | 28.4227 | 0.9581 | 118.189000 | 52.7197 | 358.3600 | 0.8183 | 28.5726 |
| 541 | 0.0 | 66.3572 | 968.4211 | 2488.00 | 1.4918 | 0.000000 | 2.708 | 3.3968 | 335.2623 | 1.225343 | ... | 1.2350 | 373.148100 | 406.0384 | 1.1782 | 0.7905 | 0.000000 | 47.7876 | 355.0564 | 0.8080 | 17.1475 |
5 rows × 107 columns
# Finally after dimensionality reduction using various methods
print(SIGNAL_DF.shape, '→', signal_df_selected.shape)
(1567, 592) → (1567, 107)
selected_cols = signal_df_selected.columns
pprint(selected_cols, compact = True) # to be used for test sets to select features
Index(['500', '37', '511', '22', '407', '482', '126', '416', '468', '546',
...
'526', '485', '15', '59', '448', '489', '438', '43', '390', '460'],
dtype='object', length=107)
et.explore(signal_df_selected) # p_na: percentage of na, p_inf: percentage of inf, p_zer: percentage of zeroes; q: quantity
| variable | obs | q_zer | p_zer | q_na | p_na | q_inf | p_inf | dtype | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 546 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 1 | 31 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 2 | 3 | 1567 | 1 | 0.06 | 0 | 0.0 | 0 | 0.0 | float64 |
| 3 | 356 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 4 | 454 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 102 | 275 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 103 | 491 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 104 | 408 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 105 | 426 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
| 106 | 14 | 1567 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
107 rows × 9 columns
signal_df_selected.describe().T.style.background_gradient('Greens')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 546 | 1567.000000 | 1.046603 | 0.360570 | 0.444400 | 0.814100 | 0.990900 | 1.250757 | 3.978600 |
| 31 | 1567.000000 | 3.673307 | 0.535026 | 2.069800 | 3.362700 | 3.431400 | 3.533500 | 4.804400 |
| 3 | 1567.000000 | 1397.226606 | 440.538934 | 0.000000 | 1083.885800 | 1287.353800 | 1591.223500 | 3715.041700 |
| 356 | 1567.000000 | 1.298650 | 0.386795 | 0.379600 | 1.025650 | 1.255400 | 1.533250 | 2.834800 |
| 454 | 1567.000000 | 7.882594 | 3.059021 | 2.329400 | 5.808150 | 7.421900 | 9.576750 | 42.070300 |
| 409 | 1567.000000 | 4.581990 | 1.773896 | 1.216300 | 3.019000 | 4.497700 | 5.935150 | 9.576500 |
| 419 | 1567.000000 | 309.106709 | 325.267266 | 0.000000 | 0.000000 | 272.448700 | 582.803100 | 998.681300 |
| 521 | 1567.000000 | 11.610080 | 103.122996 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1000.000000 |
| 523 | 1567.000000 | 0.453896 | 4.147581 | 0.025800 | 0.073050 | 0.100000 | 0.133200 | 111.333000 |
| 570 | 1567.000000 | 530.523623 | 17.499736 | 317.196400 | 530.702700 | 532.398200 | 534.356400 | 589.508200 |
| 495 | 1567.000000 | 6.807826 | 3.260019 | 1.772000 | 5.274600 | 6.607900 | 7.897200 | 107.692600 |
| 437 | 1567.000000 | 4.017883 | 1.610538 | 1.156800 | 3.071700 | 3.783800 | 4.681250 | 32.274000 |
| 493 | 1567.000000 | 2.530046 | 0.973948 | 0.833000 | 1.663750 | 2.529100 | 3.199100 | 9.402400 |
| 268 | 1567.000000 | 19.500567 | 7.327127 | 6.098000 | 13.828000 | 17.977000 | 24.653000 | 40.855000 |
| 208 | 1567.000000 | 73.268036 | 28.015208 | 5.359000 | 56.220500 | 73.313000 | 90.452500 | 172.349000 |
| 484 | 1567.000000 | 214.351134 | 211.710597 | 0.000000 | 76.866900 | 138.928600 | 288.918450 | 996.858600 |
| 562 | 1567.000000 | 262.765783 | 7.011110 | 242.286000 | 260.697000 | 264.272000 | 265.266714 | 311.404000 |
| 22 | 1567.000000 | 2699.425289 | 295.320524 | 0.000000 | 2578.125000 | 2664.000000 | 2840.625000 | 3656.250000 |
| 23 | 1567.000000 | -3806.280715 | 1379.354358 | -9986.750000 | -4370.625000 | -3820.750000 | -3356.375000 | 2363.000000 |
| 324 | 1567.000000 | 13.332585 | 6.613757 | 2.234500 | 7.578800 | 12.510900 | 17.925150 | 51.867800 |
| 424 | 1567.000000 | 3.314650 | 6.319353 | 0.486600 | 1.965450 | 2.667400 | 3.469200 | 103.180900 |
| 29 | 1567.000000 | 2.366140 | 0.408454 | 0.666700 | 2.088900 | 2.377800 | 2.655600 | 3.511100 |
| 547 | 1567.000000 | 403.836020 | 4.725650 | 372.822000 | 400.814000 | 404.268000 | 407.146714 | 421.702000 |
| 585 | 1567.000000 | 3.067640 | 3.576898 | 1.197500 | 2.306500 | 2.757700 | 3.294950 | 99.303200 |
| 37 | 1567.000000 | 66.221194 | 0.304060 | 64.919300 | 66.040800 | 66.231400 | 66.343050 | 67.958600 |
| 48 | 1567.000000 | 139.972935 | 4.522892 | 125.798200 | 136.930000 | 140.010000 | 143.194100 | 163.250900 |
| 473 | 1567.000000 | 39.393458 | 22.415410 | 0.000000 | 24.933350 | 34.198000 | 47.690150 | 358.950400 |
| 41 | 1567.000000 | 3.348107 | 2.343121 | -0.075900 | 2.694000 | 3.074000 | 3.518000 | 37.880000 |
| 569 | 1567.000000 | 21.015259 | 9.390705 | 3.250400 | 15.466200 | 18.337186 | 24.213000 | 84.802400 |
| 561 | 1567.000000 | 32.284146 | 19.020033 | 7.236900 | 15.766900 | 29.780100 | 44.113400 | 101.114600 |
| 115 | 1567.000000 | 747.383792 | 48.949250 | 544.025400 | 721.023000 | 750.861400 | 776.781850 | 924.531800 |
| 527 | 1567.000000 | 6.395717 | 1.888698 | 2.170000 | 4.895450 | 6.410800 | 7.594250 | 14.447900 |
| 250 | 1567.000000 | 109.650967 | 54.597274 | 21.010700 | 76.132150 | 103.093600 | 131.758400 | 1119.704200 |
| 273 | 1567.000000 | 20.171912 | 3.818709 | 8.651200 | 18.247100 | 19.563000 | 22.089100 | 43.573700 |
| 18 | 1567.000000 | 190.047643 | 2.778941 | 169.177400 | 188.300650 | 189.666700 | 192.178900 | 215.597700 |
| 161 | 1567.000000 | 4065.065275 | 4237.007385 | 0.000000 | 1322.500000 | 2614.000000 | 5033.000000 | 37943.000000 |
| 526 | 1567.000000 | 1.443457 | 0.958428 | 0.170500 | 0.484200 | 1.550100 | 2.211650 | 8.203700 |
| 133 | 1567.000000 | 1004.039789 | 6.522788 | 980.451000 | 999.996100 | 1004.050000 | 1008.670600 | 1020.994400 |
| 589 | 1567.000000 | 99.695697 | 93.867420 | 0.000000 | 44.368600 | 72.023000 | 114.749700 | 737.304800 |
| 442 | 1567.000000 | 1.345209 | 0.658987 | 0.097400 | 0.908800 | 1.266200 | 1.577550 | 5.131700 |
| 482 | 1567.000000 | 319.201752 | 279.048023 | 0.000000 | 0.000000 | 296.967200 | 512.390750 | 999.413500 |
| 59 | 1567.000000 | 2.961586 | 9.514338 | -28.988200 | -1.855450 | 0.951800 | 4.395500 | 168.145500 |
| 40 | 1567.000000 | 67.928227 | 23.924263 | 1.434000 | 74.570000 | 78.290000 | 80.200000 | 86.120000 |
| 390 | 1567.000000 | 1.431868 | 20.326415 | 0.304600 | 0.675150 | 0.877300 | 1.148200 | 805.393600 |
| 548 | 1567.000000 | 75.981711 | 3.221882 | 71.038000 | 73.254000 | 74.722000 | 78.446714 | 83.720000 |
| 38 | 1567.000000 | 86.836519 | 0.446619 | 84.732700 | 86.578300 | 86.820700 | 87.002400 | 88.418800 |
| 160 | 1567.000000 | 555.102015 | 574.486495 | 0.000000 | 295.000000 | 437.000000 | 624.500000 | 4170.000000 |
| 126 | 1567.000000 | 2.750794 | 0.252763 | 2.340000 | 2.574000 | 2.735000 | 2.873000 | 3.991000 |
| 453 | 1567.000000 | 5.460434 | 2.250186 | 0.903700 | 3.748750 | 5.227000 | 6.898750 | 34.490200 |
| 568 | 1567.000000 | 2.521479 | 0.929889 | 0.370600 | 1.884400 | 2.354100 | 3.015750 | 12.746200 |
| 499 | 1567.000000 | 263.175478 | 324.564638 | 0.000000 | 0.000000 | 0.000000 | 536.122600 | 1000.000000 |
| 407 | 1567.000000 | 1.231588 | 0.363973 | 0.424000 | 0.966500 | 1.237400 | 1.416700 | 3.312800 |
| 129 | 1567.000000 | -0.553561 | 1.217799 | -3.779000 | -0.898800 | -0.141900 | 0.047300 | 2.458000 |
| 496 | 1567.000000 | 29.850851 | 24.264188 | 4.813500 | 16.486550 | 22.412500 | 32.446250 | 219.643600 |
| 433 | 1567.000000 | 205.454536 | 225.667412 | 0.000000 | 10.047450 | 151.115600 | 304.541800 | 995.744700 |
| 494 | 1567.000000 | 0.956442 | 6.615200 | 0.034200 | 0.139000 | 0.232500 | 0.563000 | 127.572800 |
| 416 | 1567.000000 | 3.399034 | 1.038568 | 0.000000 | 2.659100 | 3.234000 | 4.010700 | 9.690000 |
| 134 | 1567.000000 | 39.389030 | 2.983610 | 33.365800 | 37.368900 | 38.902600 | 40.804600 | 64.128700 |
| 486 | 1567.000000 | 301.831510 | 285.447661 | 0.000000 | 0.000000 | 249.162000 | 497.384500 | 999.491100 |
| 571 | 1567.000000 | 2.101836 | 0.275112 | 0.980200 | 1.982900 | 2.118600 | 2.290650 | 2.739500 |
| 574 | 1567.000000 | 9.162315 | 26.920150 | 1.039500 | 2.567850 | 2.975800 | 3.492500 | 170.020400 |
| 71 | 1567.000000 | 104.349908 | 31.596457 | 21.433200 | 87.584600 | 102.619800 | 115.556400 | 238.477500 |
| 448 | 1567.000000 | 0.332238 | 0.236203 | 0.039900 | 0.187700 | 0.251200 | 0.351100 | 1.475400 |
| 24 | 1567.000000 | -297.321390 | 2901.069068 | -14804.500000 | -1474.375000 | -74.000000 | 1376.250000 | 14106.000000 |
| 485 | 1567.000000 | 200.801292 | 217.247592 | 0.000000 | 51.081400 | 113.781300 | 284.802450 | 994.000000 |
| 489 | 1567.000000 | 272.189763 | 226.615115 | 0.000000 | 113.806650 | 219.948800 | 375.896800 | 994.003500 |
| 511 | 1567.000000 | 276.038085 | 329.473422 | 0.000000 | 0.000000 | 0.000000 | 554.010700 | 1000.000000 |
| 1 | 1567.000000 | 2495.866816 | 80.242176 | 2158.750000 | 2452.885000 | 2499.460000 | 2538.745000 | 2846.440000 |
| 162 | 1567.000000 | 4797.633421 | 6549.623689 | 0.000000 | 451.000000 | 1787.000000 | 6424.000000 | 36871.000000 |
| 64 | 1567.000000 | 20.543352 | 4.967410 | 6.448200 | 17.377300 | 20.030900 | 22.807750 | 48.988200 |
| 557 | 1567.000000 | 1.630171 | 1.731521 | 0.164600 | 1.214900 | 1.559029 | 1.815600 | 54.291700 |
| 33 | 1567.000000 | 8.960046 | 1.344058 | 7.603200 | 8.580000 | 8.769600 | 9.060600 | 23.345300 |
| 472 | 1567.000000 | 137.879200 | 47.601061 | 11.499700 | 105.622150 | 138.250800 | 168.270250 | 492.771800 |
| 418 | 1567.000000 | 320.336860 | 287.529022 | 0.000000 | 0.000000 | 302.310800 | 523.624450 | 999.316000 |
| 290 | 1567.000000 | 0.123300 | 0.270127 | 0.041600 | 0.065200 | 0.083900 | 0.118100 | 4.420300 |
| 98 | 1567.000000 | -0.017954 | 0.426327 | -5.271700 | -0.217300 | 0.000000 | 0.188650 | 2.569800 |
| 15 | 1567.000000 | 413.085946 | 17.205084 | 333.448600 | 406.131000 | 412.228500 | 419.082800 | 824.927100 |
| 122 | 1567.000000 | 3.897743 | 0.902104 | 1.671000 | 3.202000 | 3.865000 | 4.392000 | 6.889000 |
| 500 | 1567.000000 | 240.862924 | 322.820228 | 0.000000 | 0.000000 | 0.000000 | 505.225750 | 999.233700 |
| 429 | 1567.000000 | 4.171844 | 6.435390 | 0.783700 | 2.571400 | 3.453800 | 4.755800 | 186.616400 |
| 467 | 1567.000000 | 6.250845 | 8.662697 | 1.716300 | 4.697500 | 5.643400 | 6.386900 | 109.007400 |
| 2 | 1567.000000 | 2200.573421 | 29.389054 | 2060.660000 | 2181.099950 | 2201.066700 | 2218.055500 | 2315.266700 |
| 417 | 1567.000000 | 8.190237 | 4.051974 | 2.153100 | 5.767150 | 7.396300 | 9.167850 | 39.037600 |
| 432 | 1567.000000 | 99.392565 | 126.145255 | 0.000000 | 31.033850 | 57.969300 | 120.363100 | 994.285700 |
| 488 | 1567.000000 | 351.931026 | 250.398090 | 0.000000 | 145.156850 | 347.486900 | 507.497050 | 997.518600 |
| 525 | 1567.000000 | 5.567225 | 3.893827 | 1.540000 | 4.117100 | 5.138900 | 6.332500 | 80.040600 |
| 0 | 1567.000000 | 3014.272973 | 73.550441 | 2743.240000 | 2965.965000 | 3010.920000 | 3056.540000 | 3356.350000 |
| 438 | 1567.000000 | 54.699440 | 34.086365 | 0.000000 | 36.343700 | 49.090900 | 66.666700 | 851.612900 |
| 452 | 1567.000000 | 5.346705 | 0.918913 | 2.670900 | 4.765400 | 5.271400 | 5.912900 | 13.977600 |
| 88 | 1567.000000 | 1807.815021 | 53.537262 | 1627.471400 | 1777.470300 | 1809.249200 | 1841.873000 | 2105.182300 |
| 43 | 1567.000000 | 355.538804 | 6.232716 | 342.754500 | 350.802250 | 353.727300 | 360.771800 | 377.297300 |
| 460 | 1567.000000 | 29.194749 | 13.331348 | 7.953400 | 20.224100 | 26.164400 | 35.268400 | 149.385100 |
| 412 | 1567.000000 | 30.907457 | 18.336589 | 0.000000 | 18.470300 | 26.156900 | 38.026900 | 128.281600 |
| 35 | 1567.000000 | 64.555503 | 2.573951 | 63.677400 | 64.024800 | 64.165800 | 64.344700 | 94.264100 |
| 68 | 1567.000000 | 147.439787 | 4.232628 | 87.025500 | 145.242300 | 147.597300 | 149.935900 | 167.830900 |
| 200 | 1567.000000 | 17.602412 | 8.671406 | 3.210000 | 14.175000 | 17.280000 | 20.160000 | 199.620000 |
| 520 | 1567.000000 | 2.695999 | 5.702366 | 0.312100 | 1.552150 | 2.221000 | 2.903700 | 111.736500 |
| 545 | 1567.000000 | 7.611767 | 1.314755 | 4.429400 | 7.116000 | 7.116000 | 8.021500 | 21.044300 |
| 468 | 1567.000000 | 223.999726 | 230.296010 | 0.000000 | 38.882650 | 150.465400 | 334.674000 | 999.877000 |
| 483 | 1567.000000 | 206.337023 | 191.500719 | 0.000000 | 82.410150 | 149.309643 | 260.680050 | 989.473700 |
| 510 | 1567.000000 | 55.768588 | 37.668313 | 0.000000 | 35.324400 | 47.058800 | 64.315100 | 451.485100 |
| 487 | 1567.000000 | 239.421161 | 262.174549 | 0.000000 | 57.275700 | 114.507600 | 397.063400 | 995.744700 |
| 275 | 1567.000000 | 8.920392 | 168.608489 | 0.011100 | 0.044800 | 0.078600 | 0.144900 | 3332.596400 |
| 491 | 1567.000000 | 2.443371 | 1.219949 | 0.555800 | 1.747100 | 2.250800 | 2.839800 | 12.769800 |
| 408 | 1567.000000 | 5.341513 | 2.575156 | 2.737800 | 4.127800 | 4.921900 | 5.787100 | 44.310000 |
| 426 | 1567.000000 | 1.233868 | 0.994736 | 0.363200 | 0.744200 | 1.135500 | 1.538850 | 24.990400 |
| 14 | 1567.000000 | 9.005815 | 2.794133 | 2.249300 | 7.096750 | 8.970300 | 10.858700 | 19.546500 |
We can see that most of the features have mean approximately equal to median, implying that these features are following a a normal/guassian distribution which is helpful.
# Distribution of genders in the dataset
pos = len(SIGNAL_DF[SIGNAL_DF['Pass/Fail'] == -1])
neg = len(SIGNAL_DF[SIGNAL_DF['Pass/Fail'] == 1])
plt.pie(x=[pos, neg], explode=(0, 0), labels=['Pass', 'Fail'], autopct='%1.2f%%', \
shadow=True, startangle=90,colors = ['#ffac81','#5aa2ce'], normalize = True)
fig = plt.gcf()
fig.set_size_inches(6,6)
plt.show()
print(f'Pass: {pos}\nFail: {neg}')
Pass: 1463 Fail: 104
The SECOM (Semiconductor Manufacturing) dataset, consists of manufacturing operation data and the semiconductor quality data. It contains 1567 observations taken from a wafer fabrication production line. Each observation is a vector of 590 sensor measurements plus a label of pass/fail test. Also, there are only 104 fail cases which are labeled as positive (encoded as 1), whereas much larger amount of examples pass the test and are labeled as negative (encoded as -1). This is a 1:14 proportion (6.64%), which is heavily imbalanced. So, we will have to use certain upsampling techniques so that the models aren't biased towards Pass Case
# install sweetviz for Auto EDA
# !pip install sweetviz
# or
# !conda install -c conda-forge sweetviz
import sweetviz as sv
uni_report = sv.analyze(signal_df_selected, pairwise_analysis='off')
uni_report.show_notebook()
A considerable amount of features still have a relatively low variance (number of distinct values ~= number of values) A lot of features have guassian distributions A few features also have non-standard distributions
temp_df = signal_df_selected.copy(deep = True)
temp_df['target'] = target.replace({-1: 0}).astype(bool)
bi_report = sv.analyze(temp_df, target_feat = 'target', pairwise_analysis = 'off')
bi_report.show_notebook()